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Abstract 

This  paper  describes  FFTW,  a  portable  C  package  for  computing  the  one-  and  multidimen¬ 
sional  complex  discrete  Fourier  transform  (DFT).  FFTW  is  typically  faster  than  all  other 
publicly  available  DFT  software,  including  the  well-known  FFTPACK  and  the  code  from 
Numerical  Recipes.  More  interestingly,  FFTW  is  competitive  with  or  better  than  propri¬ 
etary,  highly-tuned  codes  such  as  Sun’s  Performance  Library  and  IBM’s  ESSL  library.  FFTW 
implements  the  Cooley-Tukey  fast  Fourier  transform,  and  is  freely  available  on  the  Web  at 
http : / /theory .lcs.mit.edu/~fftw. 

Three  main  ideas  are  the  keys  to  FFTW’s  performance.  First,  the  computation  of  the 
transform  is  performed  by  an  executor  consisting  of  highly-optimized,  composable  blocks 
of  C  code  called  codelets.  Second,  at  runtime,  a  planner  finds  an  efficient  way  (called  a 
‘plan’)  to  compose  the  codelets.  Through  the  planner,  FFTW  adapts  itself  to  the  architecture 
of  the  machine  it  is  running  on.  Third,  the  codelets  are  automatically  generated  by  a  codelet 
generator  written  in  the  Caml  Light  dialect  of  ML.  The  codelet  generator  produces  long, 
optimized,  unreadable  code,  which  is  nevertheless  easy  to  modify  via  simple  changes  to  the 
generator. 

Keywords:  Fast  Fourier  transform,  high  performance,  ML,  code  generation. 


1  Introduction 


This  paper  describes  FFTW,  a  portable  C  package  for  computing  the  one-  and  multidimen¬ 
sional  complex  discrete  Fourier  transform  (DFT).  Extensive  benchmarking  demonstrates 
that  FFTW  is  typically  faster  than  all  other  publicly  available  DFT  software,  including  the 
well-known  FFTPACK  [1]  and  the  code  from  Numerical  Recipes  [2].  More  interestingly, 
FFTW  is  competitive  with  or  better  than  proprietary,  highly-tuned  codes  such  as  Sun’s  Per¬ 
formance  Library  and  IBM’s  ESSL  library.  FFTW  is  an  implementation  of  the  Cooley- 
Tukey  [3]  fast  Fourier  transform  (FFT),  and  is  freely  available  on  the  World  Wide  Web  at 
the  URL  http :  /  / theory  .  lcs  .  mit .  edu/  ~  f  ftw. 

Three  main  ideas  are  the  keys  to  FFTW’s  performance.  First,  the  computation  of  the 
transform  is  performed  by  an  executor  consisting  of  highly-optimized,  composable  blocks 
of  C  code  called  codelets.  Second,  at  runtime,  a  planner  finds  an  efficient  way  (called  a  plan ) 
to  compose  the  codelets.  Through  the  planner,  FFTW  adapts  itself  to  the  architecture  of  the 
machine  it  is  running  on.  In  this  way,  FFTW  is  a  single  program  that  performs  efficiently 
on  a  variety  of  architectures.  Third,  the  codelets  are  automatically  generated  by  a  codelet 
generator  written  in  the  Caml  Light  dialect  of  ML  [4] .  The  codelet  generator  produces  long, 
optimized,  unreadable  code,  which  is  nevertheless  easy  to  modify  via  simple  changes  to  the 
generator. 

Despite  its  internal  complexity,  FFTW  is  easy  to  use.  (See  Figure  1.)  The  user  interacts 
with  FFTW  only  through  the  planner  and  the  executor;  the  codelet  generator  is  not  used  after 
compile-time.  FFTW  provides  a  function  that  creates  a  plan  for  a  transform  of  a  certain  size. 
Once  the  user  has  created  a  plan,  she  can  use  the  plan  as  many  times  as  needed.  FFTW  is 
not  restricted  to  transforms  whose  size  is  a  power  of  2.  A  parallel  version  of  the  executor, 
written  in  Cilk  [5],  also  exists. 

The  executor  implements  the  well-known  Cooley-Tukey  algorithm  [3],  which  works  by 
factoring  the  size  N  of  the  transform  into  N  =  Ni  N2 .  The  algorithm  then  recursively  com¬ 
putes  .Y|  transforms  of  size  A2  and  A2  transforms  of  size  A] .  The  base  case  of  the  recursion 
is  handled  by  the  codelets,  which  are  hard-coded  transforms  for  various  small  sizes.  We  em¬ 
phasize  that  the  executor  works  by  explicit  recursion,  in  sharp  contrast  with  the  traditional 
loop-based  implementations  [6,  page  608].  The  recursive  divide-and-conquer  approach  is 
superior  on  modern  machines,  because  it  exploits  all  levels  of  the  memory  hierarchy:  as 
soon  as  a  subproblem  fits  into  cache,  no  further  cache  misses  are  needed  in  order  to  solve 
that  subproblem.  Our  results  contradict  the  folk  theorem  that  recursion  is  slow.  Moreover, 
the  divide-and-conquer  approach  parallelizes  naturally  in  Cilk. 

The  Cooley-Tukey  algorithm  allows  arbitrary  choices  for  the  factors  Ni  and  A2  of  N. 
The  best  choice  depends  upon  hardware  details  like  the  number  of  registers,  latency  and 
throughput  of  instructions,  size  and  associativity  of  caches,  structure  of  the  processor  pipeline, 
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fftw_plan  plan; 
int  n  =  1024; 

COMPLEX  A [n] ,  B [n] ; 

/*  plan  the  computation  */ 
plan  =  f ftw_create_plan (n) ; 

/*  execute  the  plan  */ 
f ftw (plan.  A) ; 

/*  the  plan  can  be  reused  for  other  inputs 
of  size  n  */ 
f ftw (plan,  B) ; 

Figure  1:  Simplified  example  of  FFTW’s  use.  The  user  must  first  create  a  plan,  which  can 
be  then  used  at  will.  In  the  actual  code,  there  are  a  few  other  parameters  that  specify  the 
direction,  dimensionality,  and  other  details  of  the  transform. 


etc.  Most  high-performance  codes  are  tuned  for  a  particular  set  of  these  parameters.  In  con¬ 
trast,  FFTW  is  capable  of  optimizing  itself  at  runtime  through  the  planner,  and  therefore  the 
same  code  can  achieve  good  performance  on  many  architectures.  We  can  imagine  the  plan¬ 
ner  as  trying  all  factorizations  of  N  supported  by  the  available  codelets,  measuring  their  exe¬ 
cution  times,  and  selecting  the  best.  In  practice,  the  number  of  possible  plans  is  too  large  for 
an  exhaustive  search.  In  order  to  prune  the  search,  we  assume  that  the  optimal  solution  for 
a  problem  of  size  N  is  still  optimal  when  used  as  a  subroutine  of  a  larger  problem.  With  this 
assumption,  the  planner  can  use  a  dynamic-programming  [7,  chapter  16]  algorithm  to  find 
a  solution  that,  while  not  optimal,  is  sufficiently  good  for  practical  purposes.  The  solution 
is  expressed  in  the  form  of  byte-code  that  can  be  interpreted  by  the  executor  with  negligible 
overhead.  Our  results  contradict  the  folk  theorem  that  byte-code  is  slow. 

The  codelet  generator  produces  C  subroutines  (codelets)  that  compute  the  transform  of 
a  given  (small)  size.  Internally,  the  generator  itself  implements  the  Cooley-Tukey  algorithm 
in  symbolic  arithmetic,  the  result  of  which  is  then  simplified  and  unparsed  to  C.  The  simpli¬ 
fication  phase  applies  to  the  code  many  transformations  that  an  experienced  hacker  would 
perform  by  hand.  The  advantage  of  generating  codelets  in  this  way  is  twofold.  First,  we  can 
use  much  higher  radices  than  are  practical  to  implement  manually  (for  example,  radix-32 
tends  to  be  faster  than  smaller  radices  on  RISC  processors).  Second,  we  can  easily  exper¬ 
iment  with  diverse  optimizations  and  algorithmic  variations.  For  example,  it  was  easy  to 
add  support  for  prime  factor  FFT  algorithms  (see  [8]  and  [6,  page  619])  within  the  codelets. 
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A  huge  body  of  previous  work  on  the  Fourier  transform  exists,  which  we  do  not  have 
space  to  reference  properly.  We  limit  ourselves  to  mention  some  references  that  are  impor¬ 
tant  to  the  present  paper.  A  good  tutorial  on  the  FFT  can  be  found  in  [9]  or  in  classical  text¬ 
books  such  as  [6].  Previous  work  exists  on  automatic  generation  of  FFT  programs:  [10]  de¬ 
scribes  the  generation  of  FFT  programs  for  prime  sizes,  and  [11]  presents  a  generator  of 
Pascal  programs  implementing  a  Prime  Factor  algorithm.  Johnson  and  Burrus  [12]  first  ap¬ 
plied  dynamic  programming  to  the  design  of  optimal  DFT  modules.  Although  these  papers 
all  deal  with  the  arithmetic  complexity  of  the  FFT,  we  are  not  aware  of  previous  work  where 
these  techniques  are  used  to  maximize  the  actual  performance  of  a  program.  The  behavior 
of  the  FFT  in  the  presence  of  nonuniform  memory  was  first  studied  by  [13].  Savage  [14] 
gives  an  asymptotically  optimal  strategy  for  minimizing  the  memory  traffic  of  the  FFT  un¬ 
der  very  general  conditions.  Our  divide-and-conquer  strategy  is  similiar  in  spirit  to  Sav¬ 
age’s  approach.  The  details  of  our  implementation  are  asymptotically  suboptimal  but  faster 
in  practice.  Some  other  theoretical  evidence  in  support  of  recursive  divide-and-conquer  al¬ 
gorithms  for  improving  locality  can  be  found  in  [15]. 

In  this  short  paper  we  do  not  have  space  to  give  more  details  about  the  planner  and  the 
executor.  Instead,  in  Section  2  we  present  performance  comparisons  between  FFTW  and 
various  other  programs.  In  Section  3,  we  discuss  the  codelet  generator  and  its  optimization 
strategy.  Finally,  in  Section  4  we  give  some  concluding  remarks. 


2  Performance  results 

In  this  section,  we  present  the  result  of  benchmarking  FFTW  against  many  public-domain 
and  a  few  proprietary  codes.  From  the  results  of  our  benchmark,  FFTW  appears  to  be  the 
fastest  portable  FFT  implementation  for  almost  all  transform  sizes.  Indeed,  its  performance 
is  competitive  with  that  of  the  vendor-optimized  Sun  Performance  and  ESSL  libraries  on 
the  UltraSPARC  and  the  RS/6000,  respectively.  At  the  end  of  the  section  we  describe  our 
benchmark  methodology. 

It  is  impossible  to  include  the  benchmark  results  for  all  machines  here.  Instead,  we 
present  the  data  from  two  machines:  a  167-MHz  Sun  UltraSPARC-1  and  an  IBM  RS/6000 
Model  3BT  (Figures  2  through  6).  Data  from  other  machines  are  similar  and  can  be  found 
on  our  web  site,  as  well  as  results  for  transforms  whose  size  is  not  a  power  of  2.  The  perfor¬ 
mance  results  are  given  as  a  graph  of  the  speed  of  the  transform  in  MFLOPS  versus  array 
size  for  both  one  and  three  dimensional  transforms.  The  MFLOPS  count  is  computed  by 
postulating  the  number  of  floating  point  operations  to  be  5  N  log2  N,  where  N  is  the  number 
of  complex  values  being  transformed  (see  [16,  page  23]  and  [17,  page  45]).  This  metric  is 
imprecise  because  it  refers  only  to  radix-2  Cooley-Tukey  algorithms.  Nonetheless,  it  allows 
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our  numbers  to  be  compared  with  other  results  in  the  literature  [  1  ] .  Except  where  otherwise 
noted,  all  benchmarks  were  performed  in  double  precision.  A  complete  listing  of  the  FFT 
implementations  included  in  the  benchmark  is  given  in  Table  1. 

Figure  2  shows  the  results  on  a  167MHz  UltraSPARC-1.  FFTW  outperforms  the  Sun 
Performance  Fibrary  for  large  transforms  in  double  precision,  although  Sun’s  software  is 
faster  for  sizes  between  128  and  2048.  In  single  precision  (Figure  4)  FFTW  is  superior  over 
the  entire  range.  On  the  RS/6000  FFTW  is  always  comparable  or  faster  than  IBM’s  ESSF 
library,  as  shown  in  Figures  3  and  6.  The  high  priority  that  was  given  to  memory  locality  in 
FFTW’s  design  is  evident  in  the  benchmark  results  for  large,  one-dimensional  transforms, 
for  which  the  cache  size  is  exceeded.  Especially  dramatic  is  the  factor  of  three  contrast  on 
the  RS/6000  (Figure  3)  between  FFTW  and  most  of  the  other  codes  (with  the  exception  of 
CWP,  discussed  below,  and  ESSF,  which  is  optimized  for  this  machine).  This  trend  also 
appeared  on  most  of  the  other  hardware  that  we  benchmarked. 

A  notable  program  is  the  one  labelled  ‘CWP’  in  the  graphs,  which  sometimes  surpasses 
the  speed  of  FFTW  for  large  transforms.  Unlike  all  other  programs  we  tried,  CWP  uses  a 
prime-factor  algorithm  [18,  19]  instead  of  the  Cooley-Tukey  FFT.  CWP  works  only  on  a 
restricted  set  of  transform  sizes.  Consequently,  the  benchmark  actually  times  it  for  a  trans¬ 
form  whose  size  (chosen  by  CWP)  is  slightly  larger  than  that  used  by  the  rest  of  the  codes. 
We  chose  to  include  it  on  the  graph  since,  for  many  applications,  the  exact  size  of  the  trans¬ 
form  is  unimportant.  The  reader  should  be  aware  that  the  point-to-point  comparison  of  CWP 
with  other  codes  may  be  meaningless:  CWP  is  solving  a  bigger  problem  and,  on  the  other 
hand,  it  is  choosing  a  problem  size  it  can  solve  efficiently. 

The  results  of  a  particular  benchmark  run  were  never  entirely  reproducible.  Usually,  the 
differences  from  run  to  run  were  5%  or  less,  but  small  changes  in  the  benchmark  could  pro¬ 
duce  much  larger  variations  in  performance,  which  proved  to  be  very  sensitive  to  the  align¬ 
ment  of  code  and  data  in  memory.  We  were  able  to  produce  changes  of  up  to  10%  in  the 
benchmark  results  by  playing  with  the  data  alignment  (e.g.  by  adding  small  integers  to  the 
array  sizes).  More  alarmingly,  changes  to  a  single  line  of  code  of  one  FFT  could  occasion¬ 
ally  affect  the  performance  of  another  FFT  by  more  than  10%.  The  most  egregious  offender 
in  this  respect  was  one  of  our  Pentium  Pro  machines  running  Finux  2.0. 17  and  the  gcc  2.7.2 
compiler.  On  this  machine,  the  insertion  of  a  single  line  of  code  into  FFTW  slowed  down 
a  completely  unrelated  FFT  (CWP)  by  almost  a  factor  of  twenty.  Consequently,  we  do  not 
dare  to  publish  any  data  from  this  machine.  We  do  not  completely  understand  why  the  per¬ 
formance  Pentium  Pro  varies  so  dramatically.  Nonetheless,  on  the  other  machines  we  tried, 
the  overall  trends  are  consistent  enough  to  give  us  confidence  in  the  qualitative  results  of 
the  benchmarks. 

Our  benchmark  program  works  as  follows.  The  benchmark  uses  each  FFT  subroutine 
to  compute  the  DFT  many  times,  measures  the  elapsed  time,  and  divides  by  the  number 
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Figure  2:  Comparison  of  ID  FFTs  on  a  Sun  FIPC  5000  (167MFIz  UltraSPARC-1).  Compiled 
with  cc  -native  -fast  -x05  -  da  lign.  SunOS  5.5.1,  cc  version  4.0. 
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Figure  3:  Comparison  of  ID  FFTs  on  an  IBM  RS/6000,  model  3BT,  running  AIX  version 
3.2.  Compiled  with  cc  -03  -qarch=pwrx  -qtune=pwrx. 
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Figure  4:  Comparison  of  FFTW  with  the  Sun  Performance  Library  on  the  UltraSPARC  for 
single  precision  ID  transforms. 
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Figure  5:  Comparison  of  3D  FFTs  on  the  UltraSPARC. 
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Transform  Size 


Figure  6:  Comparison  of  3D  transforms  from  FFTW  and  the  ESSL  library  on  the  RS/6000, 
in  single  precision. 
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The  Fastest  Fourier  Transform  in  the  West 

SUNPERF 

ID  FFT  from  the  Sun  Performance  Library  (UltraSPARC  version) 

*ESSL 

ID  and  3D  FFTs  from  IBM’s  ESSL  library  for  the  RS/6000.  Only  the  single 
precision  version  of  the  3D  transform  was  available  to  us. 

CWP 

A  prime-factor  FFT  implementation  by  D.  Hale  in  a  C  numerical  library 
from  the  Colorado  School  of  Mines. 

Krukar 

ID  C  FFT  by  R.  H.  Krukar. 

Nielson 

Mixed-radix,  C  FFT  by  J.  J.  Nielson. 

Singleton 

Mixed-radix,  multidimensional,  Fortran  FFT  by  R.  C.  Singleton  [20], 

FFTPACK 

Fortran  ID  FFT  library  by  P.  N.  Swarztrauber  [1], 

PDA 

3D  FFT  from  the  Public  Domain  Algorithms  library.  Uses  FFTPACK  for 
its  ID  FFTs. 

NR 

C  FFTs  in  one  or  more  dimensions  from  Numerical  Recipes  in  C  [2], 

Temperton 

Fortran  FFT  in  one  and  three  dimensions  by  C.  Temperton  [21]. 

NAPACK 

Fortran  FFT  from  the  free  NAPACK  library. 

Mayer 

ID  C  FFT  by  R.  Mayer. 

Edelblute 

ID  C  FFT  by  D.  Edelblute  and  R.  Mayer. 

Beauregard 

ID  C  FFT  by  G.  Beauregard. 

HARMD 

3D  Fortran  FFT,  author  unknown. 

Table  1 :  Description  of  the  programs  benchmarked.  All  codes  are  generally  available  except 
for  the  entries  marked  with  an  asterisk,  which  are  proprietary  codes  optimized  for  particular 
machines. 
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of  iterations  to  get  the  time  required  for  a  single  transform.  After  each  FFT,  however,  it 
is  necessary  to  reinitialize  the  array  being  transformed  (iteration  of  the  DFT  is  a  diverging 
process).  The  time  for  these  reinitializations  is  measured  separately  and  subtracted  from  the 
elapsed  time  mentioned  above. 

Instead  of  reinitializing  the  input  array  after  each  transform,  one  could  alternatively  fol¬ 
low  the  transform  by  the  inverse  transform  in  every  iteration.  Many  FFT  implementations 
compute  an  unnormalized  DFT,  however,  and  thus  it  would  be  necessary  to  have  an  addi¬ 
tional  loop  to  scale  the  data  properly.  We  did  not  want  to  measure  this  additional  cost,  since 
in  many  applications  the  scale  factor  can  easily  be  absorbed  into  other  multiplicative  factors 
to  avoid  the  extra  multiplications  and  loads. 

It  was  our  intention  to  benchmark  C  FFTs,  but  much  of  the  public-domain  software  was 
written  in  Fortran.  These  codes  were  converted  to  C  via  the  free  f  2  c  software  [22].  This 
raises  some  legitimate  concerns  about  the  quality  of  the  automatic  translation  performed 
by  f  2  c,  as  well  as  the  relative  quality  of  Fortran  and  C  compilers.  Accordingly,  we  com¬ 
pared  the  original  Fortran  FFTPACK  with  the  f  2c  version.  On  average,  the  C  code  was 
16%  slower  than  the  Fortran  version  on  the  RS/6000  and  27%  slower  on  the  UltraSPARC. 
The  Fortran  code  was  never  faster  than  FFTW.  We  will  give  more  results  for  native  Fortran 
software  in  the  final  paper. 


3  The  codelet  generator 

In  this  section  we  describe  the  codelet  generator,  that  produces  optimized  C  codelets.  It  is 
written  in  Caml  Light  [4]  because  it  is  easy  to  express  symbolic  manipulations  in  that  lan¬ 
guage.  Because  of  this  automatic  generation  process,  it  is  easy  to  produce  and  experiment 
with  long  straight-line  code.  The  generator  performs  many  optimizations  such  as  constant 
folding  and  minus-sign  propagation. 

The  codelet  generator  accepts  as  input  an  integer  n,  and  produces  a  fragment  of  C  code 
that  computes  the  Fourier  transform  of  size  n  (a  codelet).  Depending  on  the  options  with 
which  the  generator  is  invoked,  the  codelet  computes  either  the  forward  or  the  backward 
transform,  and  can  optionally  multiply  its  input  by  a  precomputed  set  of  twiddle  factors. 
The  codelet  generator  is  written  in  the  Caml  Light  dialect  of  ML  [4] .  Caml  is  an  applicative, 
polymorphic,  and  strongly-typed  functional  language  with  first-class  functions,  algebraic 
data  types,  and  pattern  matching. 

The  codelet  generator  operates  on  a  subset  of  C’s  abstract  syntax  tree  (AST).  It  first  pro¬ 
duces  an  AST  for  a  naive  program  that  computes  the  transform,  and  then  applies  local  opti¬ 
mizations  to  the  AST  in  order  to  improve  the  program.  Finally,  it  unparses  the  AST  to  pro¬ 
duce  the  desired  C  code.  The  interface  between  the  first  phase  (generation  of  the  AST)  and 
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let  fftgen_prime  N  input  output  = 

let  expr  k  =  (Sigma  0  N  (fun  n  -> 

let  coeff  =  CExp (k  *  n  /  N) 
in  CTimes  coeff  (input  n) ) ) 
and  FFT  = 

forall  0  N  (fun  k  -> 

(output  k  (expr  k) ) ) 
in  FFT  ;  ; 

Figure  7:  Caml  program  that  generates  the  AST  of  a  transform  of  size  N,  when  N  is  prime. 
The  A;th  output  (denoted  by  (expr  k )  in  the  program)  is  the  sum  (Sigma)  for  n  ranging 
from  0  to  N  —  1  of  the  nth  input  multiplied  by  a  certain  coefficient  (coeff).  The  AST  FFT 
contains  a  program  that  outputs  the  /rth  element  of  the  transform,  for  all  k  ranging  from  0 
to  N  —  1 .  The  operator  forall  concatenates  many  ASTs  into  one. 

the  following  phases  is  such  that  the  AST  can  be  expressed  in  terms  of  complex  arithmetic, 
and  the  reduction  to  an  AST  that  uses  only  real  arithmetic  is  performed  automatically. 

In  the  rest  of  the  section,  we  shall  describe  the  AST  generation  phase  and  the  optimizer. 
The  unparser  is  rather  uninteresting,  and  we  will  not  bother  to  describe  it. 

The  AST  generation  phase  creates  a  crude  AST  for  the  desired  codelet.  This  AST  con¬ 
tains  a  lot  of  useless  code,  such  as  multiplications  by  0  and  1,  but  the  code  is  polished  by  the 
following  optimization  phase.  The  AST  generator  uses  the  Cooley-Tukey  algorithm  [3]  in 
the  form  presented  by  [6,  page  611].  We  assume  that  the  reader  is  familiar  with  this  well- 
known  algorithm.  The  actual  implementation  of  the  AST  generator  consists  of  about  60 
lines  of  code.  With  20  additional  lines  of  code  our  generator  can  also  produce  an  AST  for 
the  Prime  Factor  algorithm  [8]  as  described  in  [6,  page  619]. 

Recall  that  the  Cooley-Tukey  algorithm  reduces  a  transform  of  size  N  =  NtN2  to  A] 
transforms  of  size  N-2 ,  followed  by  some  multiplications  by  certain  complex  constants  called 
twiddle  factors,  followed  by  A2  transforms  of  size  A]  .  If  A  is  prime,  the  algorithm  com¬ 
putes  the  transform  directly  according  to  the  definition.  The  AST  generator  is  an  almost 
literal  transcription  of  this  algorithm.  It  consists  of  a  recursive  function  genf  ft  that  takes 
three  arguments:  the  size  N  of  the  transform  and  two  functions  input  and  output.  When 
applied  to  an  integer  n,  the  input  function  returns  a  complex  expression  that  contains  the 
nth  input  value.  (A  complex  expression  is  a  pair  of  ASTs  representing  the  real  and  imagi¬ 
nary  parts  of  the  input  in  symbolic  form.)  Similarly,  the  function  output  can  be  applied  to 
two  arguments  k  and  x  and  returns  an  AST  that  stores  the  expression  x  into  the  kth  output 
variable.  Because  of  lack  of  space,  we  do  not  show  the  recursive  part  of  f  ft  gen,  but  we 
do  show  the  base  case  of  the  recursion  in  Figure  7. 
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All  factors  Ni  and  A2  of  N  may  seem  equivalent,  but  some  factors  are  more  equivalent 
than  others.  One  natural  choice  is  to  let  Ni  be  the  smallest  prime  factor  of  N,  but  it  turns  out 
that  if  .Y|  is  a  factor  of  N  close  to  y/N  the  resulting  codelet  is  faster.  One  reason  is  that  the 
codelet  performs  fewer  arithmetic  operations  (although  we  do  not  fully  understand  why). 
Another  reason  is  that  the  codelet  is  more  likely  to  take  advantage  of  the  large  register  sets 
of  modern  superscalar  processors,  as  we  shall  now  illustrate  with  an  example.  Suppose  that 
a  transform  of  size  16  is  desired,  but  the  processor  can  only  compute  a  transform  of  size  4 
using  internal  registers.  If  we  choose  Ni  =  N-2  =  4,  than  the  processor  can  load  the  input 
once  from  memory,  compute  4  transforms  of  size  4  storing  the  result  back  into  memory, 
and  then  do  the  same  thing  again.  In  total,  the  input  is  read  twice.  It  is  easy  to  see  that  if  we 
let  Ni  =  2,  we  force  the  processor  to  read  the  input  array  three  times.  Within  our  codelet 
generator,  this  trick  could  be  implemented  in  just  a  few  minutes. 

We  now  describe  the  optimizer.  The  goal  of  the  optimizer  is  to  transform  a  raw  AST 
into  an  equivalent  one  that  executes  much  faster.  The  optimizer  consists  of  a  set  of  rules 
that  are  applied  locally  to  all  nodes  of  an  AST.  Most  of  the  rules  are  pretty  obvious,  such  as 
“a  +  0  =>  a”  and  the  like,  but  some  rules  are  far  more  subtle.  We  now  give  an  example  of 
how  the  rules  are  implemented  in  the  actual  codelet  generator,  and  then  we  discuss  some  of 
the  more  subtle  rules  that  we  found  useful. 

Figure  8  shows  a  fragment  of  the  actual  implementation  of  the  optimizer.  The  pattern¬ 
matching  features  of  Caml  Light  turned  out  to  be  particularly  useful  for  this  purpose.  By 
looking  at  the  example,  the  reader  can  convince  herself  that  a  sufficiently  powerful  optimizer 
can  be  implemented  quite  easily  [23,  page  108]. 

By  playing  with  the  optimizer  we  found  some  interesting  rules  to  make  the  codelets 
faster.  Consider  for  example  the  two  fragments  of  code  in  Figure  9.  At  first  glance,  it  ap¬ 
pears  that  the  two  fragments  should  perform  comparably.  After  all,  both  contain  the  same 
number  and  type  of  arithmetic  operations,  and  in  the  same  order  (subtractions  and  additions 
are  performed  by  the  same  hardware,  and  are  thus  equivalent  when  talking  about  perfor¬ 
mance).  The  fragment  on  the  right  executes  faster  on  all  processors  we  have  tried,  however. 
The  reason  is  that  floating-point  constants  are  not  created  out  of  thin  air,  but  must  be  stored 
somewhere  in  memory.  The  fragment  on  the  right  loads  the  constant  0.5  only  once,  while 
the  code  on  the  left  must  load  both  0.5  and  —0.5  from  memory.  As  a  rule,  the  optimizer 
makes  all  constants  positive  and  propagates  the  minus  sign  to  the  rest  of  the  AST.  We  found 
that  this  rule  typically  yielded  a  speed  improvement  of  about  10-15%. 

Another  interesting  result  that  arose  from  our  investigations  is  shown  in  Figure  10.  Con¬ 
ventional  wisdom  [24,  page  84]  dictates  that  the  common  subexpression  c  +  d  be  pre¬ 
computed  and  stored  into  a  temporary  variable,  as  shown  in  the  right  part  of  the  figure.  On 
the  contrary,  we  found  that  this  transformation  does  not  produce  faster  code  on  present-day 
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let  simplify_t imes  =  fun 

(Real  a)  (Real  b)  ->  (Real  (a  *.  b) ) 

|  a  (Real  b)  -> 

simplif y_t imes  (Real  b)  a 
|  (Uminus  a)  b  -> 

simplify  (Uminus  (Times  (a,b))) 

|  a  (Uminus  b)  -> 

simplify  (Uminus  (Times  (a,b))) 

|  (Real  a)  b  -> 

if  (almost_equal  a  0.0) 
then  (Real  0.0) 

else  if  (almost_equal  a  1.0)  then  b 
else  if  (almost_equal  a  (-1.0)) 
then  simplify  (Uminus  b) 
else  Times  ( (Real  a) ,  b) 

|  a  b  ->  Times  (a,  b) 

Figure  8:  Example  of  the  rules  that  constitute  the  optimizer.  The  function  shown  in  the 
figure  simplifies  the  product  of  two  factors.  If  both  factors  are  real  numbers,  the  optimizer 
replaces  the  multiplication  by  a  single  real  number.  Minus  signs  are  propagated  up,  so  that 
another  set  of  rules  (not  shown)  can  collapse  two  consecutive  minus  signs.  Multiplications 
by  constants  can  be  simplified  when  the  constant  is  0,  1  or  —1. 

compilers.  Indeed,  in  some  cases  we  found  that  the  elimination  of  the  common  subexpres¬ 
sion  produced  slower  code.  The  reason  for  this  behavior  is  not  clear.  From  our  understand¬ 
ing  at  this  point,  a  C  compiler  may  unnecessarily  waste  registers  when  temporary  variables 
are  declared  explicitly. 

4  Conclusions 

This  paper  described  the  design  and  the  performance  of  FFTW,  a  self-optimizing  library  for 
computing  the  one-  and  multidimensional  complex  discrete  Fourier  transform. 

The  current  version  of  FFTW  extends  the  program  described  in  this  paper  in  several 
directions.  We  have  written  three  parallel  versions,  using  Cilk  [5],  Posix  threads  [25]  and 
MPI  [26].  We  also  support  multidimensional  real-complex  transforms.  FFTW  has  now  a 
mechanism  to  save  plans  to  disk,  and  can  use  fragments  of  plans  in  order  to  reduce  the  plan¬ 
ning  time. 
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Figure  9:  Two  fragments  of  C  code  containing  the  same  number  and  type  of  arithmetic  op¬ 
erations,  in  the  same  order.  Nonetheless,  the  fragment  on  the  right  is  faster.  See  the  text  for 
an  explanation. 

{ 

double  tmp  = 

a  =  b  +  (c  +  d)  ;  c  +  d; 

e=f+(c+d);  a=b+  tmp; 

e  =  f  +  tmp; 

} 

Figure  10:  Two  equivalent  fragments  of  C  code;  the  fragment  on  the  right  explicitly  stores 
the  common  subexpression  into  a  temporary  variable.  We  found  that,  on  modern  compilers, 
the  fragment  on  the  left  is  not  slower  than  the  one  in  the  right,  and  in  some  cases  it  is  faster. 


FFTW  has  enjoyed  excellent  acceptance  in  the  Internet  community.  It  was  downloaded 
by  more  than  600  users  in  the  first  month  after  its  release,  many  of  whom  have  reported  sig¬ 
nificant  speedups  in  their  applications.  It  has  continued  to  gain  users,  and  is  now  part  of  the 
net  lib  repository  of  scientific  software.  FFTW  has  been  adopted  in  the  FFT  component 
of  the  Ptolemy  project  [27],  a  software  environment  for  signal  processing  and  simulation. 
In  addition,  the  VSIP  (Vector/S ignal/Image  Processing  Forum)  committee  is  discussing  the 
possibility  of  incorporating  FFTW  into  the  VSIP  reference  implementation  as  an  example 
of  how  to  use  FFTs  that  have  an  optimize/initialization  phase  before  first  use.  Their  goal  is 
to  define  an  industry- standard  API  for  vector,  signal,  and  image  processing  primitives  for 
embedded  real-time  signal  processing  systems. 
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